
Theoretical and empirical studies [21,77,9,55,49,5,12]postulate that such flatter regions generalize better than sharper minima, e.g., due to the flat minimizer's robustness against loss function shifts between trainandtestdata,asillustrated inFig.1.